Skip to content

Conversation

@ishandhanani
Copy link
Contributor

@ishandhanani ishandhanani commented Jun 24, 2025

This PR is a revamp of #1565 based on this comment by @rmccorm4.

Description

OpenAI Completions supports the following inputs

pub enum Prompt {
    String(String),
    StringArray(Vec<String>),
    // Minimum value is 0, maximum value is 50256 (inclusive).
    IntegerArray(Vec<u16>),
    ArrayOfIntegerArray(Vec<Vec<u16>>),
}

The PR here provides batch style support for StringArray, and ArrayOfIntegerArray and provides similar support as String (default) to IntegerArray.

The sglang_inc.py engine has been updated to demonstrate this

Approach

In order to minimize any performance hit, I first match based on input. If we see a tokens style input, we cast to u32 and construct the request. If not - we move forward with tokenization as expected

Tests with all 4 input types

Script to tokenize text
from transformers import AutoTokenizer

# 1. Choose the model ID
model_name = "Qwen/Qwen2.5-7B"

# 2. Load the tokenizer
#    (the Qwen3 tokenizer is bundled in the repo on HF)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3. Your text
text = "A large language model is a"

# 4. Encode to get token IDs (omit special tokens if you just want raw subwords)
encoding = tokenizer(text, add_special_tokens=False)

# 5. Inspect the results
print("Token IDs: ", encoding["input_ids"])

Using the following for testing

  • [32, 3460, 4128, 1614, 374, 264]
  • A large language model is a
  • Qwen/Qwen2.5-7B
IntegerArray

input

curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-7B",
    "prompt": [32, 3460, 4128, 1614, 374, 264],
    "stream": false,
    "max_tokens": 30
  }'

output

{
  "id": "cmpl-3f3f10ca-320d-4740-b4c6-d044436a0655",
  "choices": [
    {
      "text": " machine learning tool conceived by Google last August.\n\nImagine putting Strings into an ocean, taking a dip and then catching one for a meal.\n\n\nReth",
      "index": 0,
      "finish_reason": null
    }
  ],
  "created": 1750819021,
  "model": "Qwen/Qwen2.5-7B",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 29,
    "total_tokens": 0,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}
ArrayOfIntegerArray

input

curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-7B",
    "prompt": [[32, 3460, 4128, 1614, 374, 264], [32, 3460, 4128, 1614, 374, 264]],
    "stream": false,
    "max_tokens": 30
  }'

output

{
  "id": "cmpl-50006ebd-b759-41fc-9d5f-335651b1910d",
  "choices": [
    {
      "text": " type of artificial intelligence designed to carry out human-like, one-sided conversations. Influenced by developments in the NLP/NLU “Big Bang”,",
      "index": 0,
      "finish_reason": null
    },
    {
      "text": " computer model. But that a shorthand. Apparently confusing. A large new language model is both big and notoriously vague. But stay tuned.— Casey C",
      "index": 1,
      "finish_reason": null
    }
  ],
  "created": 1750819161,
  "model": "Qwen/Qwen2.5-7B",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 58,
    "total_tokens": 0,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}
String (with "nvext": {"use_raw_prompt":true)

input

curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-7B",
    "prompt": "A large language model is a",
    "stream": false,
    "max_tokens": 30,
    "nvext": {"use_raw_prompt":true}
  }'
{
  "id": "cmpl-c2f7ef91-dff9-4a19-b3fa-75b41779f5f4",
  "choices": [
    {
      "text": " complex mathematical system that can be used to solve a wide variety of problems. The model consists of a sequence of tasks, each of which is solved",
      "index": 0,
      "finish_reason": null
    }
  ],
  "created": 1750819802,
  "model": "Qwen/Qwen2.5-7B",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 29,
    "total_tokens": 0,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}
StringArray

input

curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-7B",
    "prompt": ["A large language model is a", "A large language model is a"],
    "stream": false,
    "max_tokens": 30
  }'

output

{
  "id": "cmpl-a51a4d40-9e82-4aec-80c8-140073ecd0fd",
  "choices": [
    {
      "text": " model trained on a diverse dataset consisting of text, images, audio, and natural language processing data. These models allow developers to take feedback from humans",
      "index": 0,
      "finish_reason": null
    },
    {
      "text": " math-informed system which can be used for predicting choices along with processing output for respective ones. Summarizing Deep Learning Books Artificial Intelligence is considered",
      "index": 1,
      "finish_reason": null
    }
  ],
  "created": 1750820088,
  "model": "Qwen/Qwen2.5-7B",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 58,
    "total_tokens": 0,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 24, 2025

Walkthrough

This update refactors prompt preprocessing to explicitly handle both text and tokenized prompt inputs, adds support for batch token IDs, and extends trait and request implementations to distinguish and extract token-based inputs. Additionally, it modifies SSE event filtering for LLM metric annotations to depend on the DYN_RICH_EVENT_STREAM environment variable.

Changes

File(s) Change Summary
lib/llm/src/http/service/openai.rs SSE event filtering for LLM metric annotations now depends on DYN_RICH_EVENT_STREAM environment variable.
lib/llm/src/preprocessor.rs Refactored preprocessing logic to explicitly handle text vs. tokenized prompt inputs and set token IDs accordingly.
lib/llm/src/preprocessor/prompt.rs Added TokenInput and PromptInput enums; extended OAIChatLikeRequest trait with input type and token extraction methods.
lib/llm/src/preprocessor/prompt/template/oai.rs Implemented new trait methods for NvCreateCompletionRequest to classify and extract token inputs from prompts.
lib/llm/src/protocols/common/preprocessor.rs Added optional batch_token_ids field to PreprocessedRequest struct for batch token input support.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant OpenAIPreprocessor
    participant Request
    participant PreprocessedRequestBuilder

    Client->>OpenAIPreprocessor: preprocess_request(Request)
    OpenAIPreprocessor->>Request: prompt_input_type()
    alt Prompt is Tokens
        OpenAIPreprocessor->>Request: extract_tokens()
        OpenAIPreprocessor->>PreprocessedRequestBuilder: set token_ids or batch_token_ids
    else Prompt is Text
        OpenAIPreprocessor->>Request: get raw or formatted prompt
        OpenAIPreprocessor->>OpenAIPreprocessor: tokenize prompt
        OpenAIPreprocessor->>PreprocessedRequestBuilder: set token_ids
    end
    OpenAIPreprocessor->>PreprocessedRequestBuilder: set sampling_options, annotations
    PreprocessedRequestBuilder->>Client: PreprocessedRequest
Loading

Possibly related PRs

Poem

In fields of code where prompts may roam,
Now tokens march or text may comb.
Batch or single, all inputs shine,
With metrics streaming down the line.
A bunny hops through preprocess land—
Richer prompts now close at hand!
🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@ishandhanani
Copy link
Contributor Author

There's some Dockerfile changes here that I'm using to test. I'll be removing them before I merge this PR. They belong in #1583

Copy link
Member

@paulhendricks paulhendricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to pull out some of the internals of OpenAIPreprocessor so we can add test coverage in the module for different edge cases e.g. "", ["", ""], [], [[]] instead of using the e2e with curl scripts.

Overall approving, looks good!

@ishandhanani ishandhanani enabled auto-merge (squash) June 25, 2025 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants